BUGfix: Fix image_grid_thw IndexError in GRPOTrainer with Multimodal Models (Qwen3-VL) due to None Values in Chat Content#5364
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| cleaned_item = remove_empty_fields(item) | ||
| cleaned_inputs.append(cleaned_item) | ||
| prompts.append(cleaned_item["prompt"]) | ||
| inputs = cleaned_inputs |
There was a problem hiding this comment.
Broad None stripping removes top-level image keys breaking detection
High Severity
remove_empty_fields is applied to the entire input dict, not just the prompt content blocks. This strips top-level keys with None values, including "image". When inputs[0] has "image": None (a text-only sample in a mixed batch) but other inputs have actual images, the key is removed from inputs[0]. The subsequent check "image" in inputs[0] then fails, causing images = None and silently losing all images in the batch. The fix should only clean the nested prompt content, not the entire input dict.
Additional Locations (1)
There was a problem hiding this comment.
inputs[0]["image"]should be PIL and will no be removed by my changes.
print(inputs)
[{'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': '../datas/VisuRiddles/images/sichuan/2021_59.png', 'text': None, 'type': 'image'}, {'image': None, 'text': '[Logical Reasoning] \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\njson\n{"answer": "X"}', 'type': 'text'}], 'role': 'user'}], 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFBE5FEB5D0>, 'metadatas': {'gold_answer': 'A'}}, {'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': '../datas/VisuRiddles/images/sichuan/2021_59.png', 'text': None, 'type': 'image'}, {'image': None, 'text': '[Logical Reasoning] \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\njson\n{"answer": "X"}', 'type': 'text'}], 'role': 'user'}], 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFBE5FEBC90>, 'metadatas': {'gold_answer': 'A'}}]| cleaned_item = remove_empty_fields(item) | ||
| cleaned_inputs.append(cleaned_item) | ||
| prompts.append(cleaned_item["prompt"]) | ||
| inputs = cleaned_inputs |
There was a problem hiding this comment.
Fix not propagated to RLOO trainer's duplicated code
Medium Severity
The remove_empty_fields logic was added only to grpo_trainer.py but not to rloo_trainer.py, which has the same duplicated _generate_and_score_completions method with the identical prompts = [x["prompt"] for x in inputs] pattern. Per project rules, changes to duplicated logic across trainers must be applied consistently to all copies.
Triggered by project rule: BUGBOT.md
There was a problem hiding this comment.
I'm not sure if the subsequent execution flow and call stack of rloo_trainer.py are exactly the same as grpo_trainer.py. So, to be safe, I will only modify the grpo_trainer that has already been tested.
|
Thanks, I understand the issue. However, I’m not convinced that TRL should support this kind of "polluted" dataset. It seems more appropriate for users to handle data cleaning upstream. As a general rule of thumb, if this isn’t supported in Transformers (as indicated by the error), then it probably shouldn’t be supported in TRL either. Otherwise, we risk going down a slippery slope where supporting one such case leads to an endless stream of similar edge cases. In this case the easiest is probably to map: def clean_empty_images(example):
for message in example["prompt"]:
for element in message["content"]:
if element["type"] == "text" and "image" in element:
element.pop("image")
return example
dataset = dataset.map(clean_empty_images)@albertvillanova what do you think? |
Thank you for your reply ! Actually, the None value "pollution" is exactlly introduced by dataset.map(). Check this out. from transformers import AutoProcessor
from datasets import Dataset
model_name_or_path = "/home/ma-user/work/Downloads/Models/Qwen/Qwen3-VL-2B-Thinking"
processor = AutoProcessor.from_pretrained(model_name_or_path)
full_question = """What's on the image ? """
samples = [[
{"role": "system", "content": [{"type": "text", "text": "You are good at step by step reasoning."}]},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": full_question},
],
},
],[
{"role": "system", "content": [{"type": "text", "text": "You are good at step by step reasoning."}]},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": full_question,},
],
},
]]
dataset = Dataset.from_list([
{"prompt": s}
for s in samples
])
def clean_empty_images(example):
for message in example["prompt"]:
for element in message["content"]:
if element["type"] == "text" and "image" in element:
element.pop("image")
return example
dataset1 = dataset.map(clean_empty_images) # dataset.map() is actually the polution source
print(dataset1[0])
"""
{'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg', 'text': None, 'type': 'image'}, {'image': None, 'text': "What's on the image ? ", 'type': 'text'}], 'role': 'user'}]}
"""Indeed, as a general rule of thumb, it would be better to 1) fix this dataset.map() to not generate None value keys; or 2) fix jinja2 to get correct tokenization tolerating None value keys. However, I’m not very familiar with these two libraries and am more comfortable with TRL. Given my limited expertise, the current code modification is the best solution I can come up with to help the community address this bug. If your engineers can fix |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks for flagging, the investigation and the proposed fix, @SolarWindRider! And thanks for the ping, @qgallouedec, really appreciate it. 🤗
This is actually a known issue coming from datasets when mixed types introduce None values.
We've run into similar problems before and added a small utility (remove_none_values) to sanitize the inputs on our side, and used it for SFT and DPO:
trl/trl/trainer/dpo_trainer.py
Lines 866 to 869 in 9a29d28
That said, I have good new: this is now properly addressed upstream by datasets! Recent versions of datasets provide the Json feature type along with on_mixed_types="use_json" during mapping, which avoids introducing these Nones in the first place (available since datasets>=4.7.0).
Given that, it might be cleaner to rely on the upstream fix rather than maintaining workarounds on our end. I’m thinking we could pin datasets to a compatible version: I’ll open a small PR for that so we can discuss.
|
Good to know! Im closing this PR. |


Fix
IndexErrorin GRPOTrainer with Multimodal Models due toNoneValues in Chat ContentSummary
This PR fixes a critical bug in
GRPOTrainerthat causes training to fail completely when using multimodal models (Qwen3-VL) where chat messages contain content blocks withNonevalues—a common pattern when datasets are processed by automated pipelines.The Problem
Severity: 🔴 Critical (Training Blocker)
When training with GRPOTrainer on Qwen3-VL, I encounter this cryptic error:
This error occurs deep inside transformers'
processing_qwen3_vl.py:Root Cause Analysis
The error message is misleading:
processing_qwen3_vl.pywhen accessingimage_grid_thw[index]The debugging journey:
Through breakpoint debugging, I traced the issue to the chat template rendering step.
The Fix
Filter out
Nonevalues from content blocks before passing toapply_chat_template():Location:
trl/trainer/grpo_trainer.py, line ~1709, in the input processing loopThis fix is minimal, surgical, and correct because the fix is placed at the exact location where prompts are processed, minimizing impact
Impact
Nonevalues in optional fields (e.g.,{'image': None, 'text': '...', 'type': 'text'})Testing
Note
Medium Risk
Touches the core GRPO training/eval generation path and changes the exact kwargs passed through
inputs(including env reset kwargs), which could subtly affect datasets that rely onNoneplaceholders.Overview
Fixes multimodal GRPO training crashes by recursively removing
Nonevalues from each sample in_generate_and_score_completionsbefore buildingpromptsand running environment resets.This ensures chat-template rendering/tokenization doesn’t mis-handle
Nonecontent blocks (e.g., VLM image/text parts), avoiding downstream processor errors likeimage_grid_thwindex mismatches.Written by Cursor Bugbot for commit 02a4d60. This will update automatically on new commits. Configure here.